Codee: A Tensor Embedding Scheme for Binary Code Search

نویسندگان

چکیده

Given a target binary function, the code search retrieves top-K similar functions in repository, and represent that they are compiled from same source codes. Searching is particularly challenging due to large variations of compiler tool-chains options CPU architectures, as well thousands Furthermore, there some pivotal issues current schemes, including inaccurate text-based or token-based analysis, slow graph matching, complex deep learning processes. In this paper, we present an unsupervised tensor embedding scheme, Codee, carry out efficiently accurately at function level. First, use NLP-based neural network generate semantic-aware token embedding. Second, propose efficient basic block generation algorithm based on representation model. We learn both semantic information instructions control flow structural Then all embeddings obtain variable-length feature vector. Third, build singular value decomposition, which compresses vectors into short fixed-length facilitate afterward. further dynamic compression incrementally update database. Finally, local sensitive hash method find top- $K$ matching repository. Compared with state-of-the-art cross-optimization-level such Asm2Vec DeepBinDiff, our scheme achieves higher average accuracy, shorter vectors, faster performance using four datasets, OpenSSL, Coreutils, libgmp libcurl. other cross-platform Gemini, Safe, recall also outperforms others.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Non-MDS Erasure Code Scheme for Storage Applications

This paper investigates the use of redundancy and self repairing against node failures indistributed storage systems using a novel non-MDS erasure code. In replication method, accessto one replication node is adequate to reconstruct a lost node, while in MDS erasure codedsystems which are optimal in terms of redundancy-reliability tradeoff, a single node failure isrepaired after recovering the ...

متن کامل

Quadra-Embedding: Binary Code Embedding with Low Quantization Error

Thanks to compact data representations and fast similarity computation, many binary code embedding techniques have been proposed for large-scale similarity search used in many computer vision applications including image retrieval. Most prior techniques have centered around optimizing a set of projections for accurate embedding. In spite of active research efforts, existing solutions suffer fro...

متن کامل

Towards Optimal Binary Code Learning via Ordinal Embedding

Binary code learning, a.k.a., hashing, has been recently popular due to its high efficiency in large-scale similarity search and recognition. It typically maps high-dimensional data points to binary codes, where data similarity can be efficiently computed via rapid Hamming distance. Most existing unsupervised hashing schemes pursue binary codes by reducing the quantization error from an origina...

متن کامل

Ordinal Constrained Binary Code Learning for Nearest Neighbor Search

Recent years have witnessed extensive attention in binary code learning, a.k.a. hashing, for nearest neighbor search problems. It has been seen that high-dimensional data points can be quantized into binary codes to give an efficient similarity approximation via Hamming distance. Among existing schemes, ranking-based hashing is recent promising that targets at preserving ordinal relations of ra...

متن کامل

Embedding in a perfect code

For any 1-error-correcting binary code C of length m we will construct a 1-perfect binary code P (C) of length n = 2 − 1 such that fixing the last n − m coordinates by zeroes in P (C) gives C. In particular, any complete or partial Steiner triple system (or any other system that forms a 1-code) can always be embedded in a 1-perfect code of some length (compare with [13]). Since the weight-3 wor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Software Engineering

سال: 2022

ISSN: ['0098-5589', '1939-3520', '2326-3881']

DOI: https://doi.org/10.1109/tse.2021.3056139